Automatic Extraction of Generic Web Page Components
نویسنده
چکیده
Information on the World Wide Web is accessed not just visually, but also automatically by systems, such as search engines and alternative browsers (e.g. screen readers and voice browsers), which extract and present relevant data automatically from Web pages. In most cases extraction cannot be performed directly, since HTML documents of today lack adequate semantic markup. This thesis proposes a method that converts an HTML document to a semantically enhanced document representation, from which generic document components can be extracted for further knowledge exploration or alternative presentation. The document is parsed and iteratively smaller nodes are mapped to a classification ontology, which then are aggregated into larger segments, thereby creating a semantically enhanced parse tree. Segment boundaries are detected based on visual and document segments, such as images and headings. Experimental results of the implementation show that document components, such as headings and menus, can be extracted directly from the semantic parse tree. The heading extraction experiment achieved recall and precision rates of 88% and 91%. The recall and precision rates for the menu extraction experiment where 90%.
منابع مشابه
Integrating Information Extraction and Automatic Hyperlinking
This paper presents a novel information system integrating advanced information extraction technology and automatic hyper-linking. Extracted entities are mapped into a domain ontology that relates concepts to a selection of hyperlinks. For information extraction, we use SProUT, a generic platform for the development and use of multilingual text processing components. By combining finite-state a...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملUnsupervised Structured Data Extraction from Template-generated Web Pages
This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle t...
متن کاملTowards Automatic Structured Web Data Extraction System
Automatic extraction of structured data from web pages is one of the key challenges for the Web search engines to advance into the more expressive semantic level. Here we propose a novel data extraction method, called ClustVX. It exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and...
متن کاملAutomatic Hidden-Web Table Interpretation by Sibling Page Comparison
The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in whi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004